The data set analyzed can be obtained from the Kaggle platform. A collection of YouTube giants, this dataset offers a perfect avenue to analyze and gain valuable insights from the luminaries of the platform. With comprehensive details on top creators’ subscriber counts, video views, upload frequency, country of origin, earnings, and more.Here is the original data source global youtube statisctic.
Load needed libraries
library(ggplot2)
library(ggthemes)
library(dplyr)
library(shiny)
library(maps)
library(hexbin)
library(countrycode)
library(rworldmap)
library(leaflet)
library(plotly)
library(janitor)Set up a consistent plot format
plot_theme <- theme_few() +
theme(plot.title = element_text(color = "darkred",hjust = 0.5)) +
theme(strip.text.x = element_text(size = 14, colour = "#202020"))And then read the data through read.csv function.
Look at the structure of the data.
## 'data.frame': 995 obs. of 28 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Youtuber : chr "T-Series" "YouTube Movies" "MrBeast" "Cocomelon - Nursery Rhymes" ...
## $ subscribers : int 245000000 170000000 166000000 162000000 159000000 119000000 112000000 111000000 106000000 98900000 ...
## $ video.views : num 2.28e+11 0.00 2.84e+10 1.64e+11 1.48e+11 ...
## $ category : chr "Music" "Film & Animation" "Entertainment" "Education" ...
## $ Title : chr "T-Series" "youtubemovies" "MrBeast" "Cocomelon - Nursery Rhymes" ...
## $ uploads : int 20082 1 741 966 116536 0 1111 4716 493 574 ...
## $ Country : chr "India" "United States" "United States" "United States" ...
## $ Abbreviation : chr "IN" "US" "US" "US" ...
## $ channel_type : chr "Music" "Games" "Entertainment" "Education" ...
## $ video_views_rank : int 1 4055159 48 2 3 4057944 5 44 630 8 ...
## $ country_rank : num 1 7670 1 2 2 NaN 3 1 5 5 ...
## $ channel_type_rank : num 1 7423 1 1 2 ...
## $ video_views_for_the_last_30_days : num 2.26e+09 1.20e+01 1.35e+09 1.98e+09 1.82e+09 ...
## $ lowest_monthly_earnings : num 564600 0 337000 493800 455900 ...
## $ highest_monthly_earnings : num 9.0e+06 5.0e-02 5.4e+06 7.9e+06 7.3e+06 ...
## $ lowest_yearly_earnings : num 6.8e+06 4.0e-02 4.0e+06 5.9e+06 5.5e+06 ...
## $ highest_yearly_earnings : num 1.08e+08 5.80e-01 6.47e+07 9.48e+07 8.75e+07 ...
## $ subscribers_for_last_30_days : num 2e+06 NaN 8e+06 1e+06 1e+06 NaN NaN NaN 1e+05 6e+05 ...
## $ created_year : num 2006 2006 2012 2006 2006 ...
## $ created_month : chr "Mar" "Mar" "Feb" "Sep" ...
## $ created_date : num 13 5 20 1 20 24 12 29 14 23 ...
## $ Gross.tertiary.education.enrollment....: num 28.1 88.2 88.2 88.2 28.1 NaN 88.2 63.2 81.9 88.2 ...
## $ Population : num 1.37e+09 3.28e+08 3.28e+08 3.28e+08 1.37e+09 ...
## $ Unemployment.rate : num 5.36 14.7 14.7 14.7 5.36 NaN 14.7 2.29 4.59 14.7 ...
## $ Urban_population : num 4.71e+08 2.71e+08 2.71e+08 2.71e+08 4.71e+08 ...
## $ Latitude : num 20.6 37.1 37.1 37.1 20.6 ...
## $ Longitude : num 79 -95.7 -95.7 -95.7 79 ...
The dataset is a data frame with 995 observations and 28 variables. These 28 variables cover various types of information, including factor, integer, and numeric variables. Among these 28 variables, there are 7 factor variables, where “Youtuber” and “category” are two factors with a large number of levels. This can mean that there are many different YouTubers and categories of videos. In addition, the dataset also includes 4 integer variables that may involve count-related information such as “rank” and “subscribers”. There are also 17 numeric variables, some of which may be floating point, covering various statistical and quantitative data, such as “video.views”, “lowest_monthly_earnings” and “Population”.
Use summary function to obtain descriptive statistics
## rank Youtuber subscribers video.views
## Min. : 1.0 Length:995 Min. : 12300000 Min. :0.000e+00
## 1st Qu.:249.5 Class :character 1st Qu.: 14500000 1st Qu.:4.288e+09
## Median :498.0 Mode :character Median : 17700000 Median :7.761e+09
## Mean :498.0 Mean : 22982412 Mean :1.104e+10
## 3rd Qu.:746.5 3rd Qu.: 24600000 3rd Qu.:1.355e+10
## Max. :995.0 Max. :245000000 Max. :2.280e+11
##
## category Title uploads Country
## Length:995 Length:995 Min. : 0.0 Length:995
## Class :character Class :character 1st Qu.: 194.5 Class :character
## Mode :character Mode :character Median : 729.0 Mode :character
## Mean : 9187.1
## 3rd Qu.: 2667.5
## Max. :301308.0
##
## Abbreviation channel_type video_views_rank country_rank
## Length:995 Length:995 Min. : 1 Min. : 1.0
## Class :character Class :character 1st Qu.: 323 1st Qu.: 11.0
## Mode :character Mode :character Median : 916 Median : 51.0
## Mean : 554249 Mean : 386.1
## 3rd Qu.: 3584 3rd Qu.: 123.0
## Max. :4057944 Max. :7741.0
## NA's :1 NA's :116
## channel_type_rank video_views_for_the_last_30_days lowest_monthly_earnings
## Min. : 1.0 Min. :1.000e+00 Min. : 0
## 1st Qu.: 27.0 1st Qu.:2.014e+07 1st Qu.: 2700
## Median : 65.5 Median :6.408e+07 Median : 13300
## Mean : 745.7 Mean :1.756e+08 Mean : 36886
## 3rd Qu.: 139.8 3rd Qu.:1.688e+08 3rd Qu.: 37900
## Max. :7741.0 Max. :6.589e+09 Max. :850900
## NA's :33 NA's :56
## highest_monthly_earnings lowest_yearly_earnings highest_yearly_earnings
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 43500 1st Qu.: 32650 1st Qu.: 521750
## Median : 212700 Median : 159500 Median : 2600000
## Mean : 589808 Mean : 442257 Mean : 7081814
## 3rd Qu.: 606800 3rd Qu.: 455100 3rd Qu.: 7300000
## Max. :13600000 Max. :10200000 Max. :163400000
##
## subscribers_for_last_30_days created_year created_month created_date
## Min. : 1 Min. :1970 Length:995 Min. : 1.00
## 1st Qu.: 100000 1st Qu.:2009 Class :character 1st Qu.: 8.00
## Median : 200000 Median :2013 Mode :character Median :16.00
## Mean : 349079 Mean :2013 Mean :15.75
## 3rd Qu.: 400000 3rd Qu.:2016 3rd Qu.:23.00
## Max. :8000000 Max. :2022 Max. :31.00
## NA's :337 NA's :5 NA's :5
## Gross.tertiary.education.enrollment.... Population Unemployment.rate
## Min. : 7.60 Min. :2.025e+05 Min. : 0.750
## 1st Qu.: 36.30 1st Qu.:8.336e+07 1st Qu.: 5.270
## Median : 68.00 Median :3.282e+08 Median : 9.365
## Mean : 63.63 Mean :4.304e+08 Mean : 9.279
## 3rd Qu.: 88.20 3rd Qu.:3.282e+08 3rd Qu.:14.700
## Max. :113.10 Max. :1.398e+09 Max. :14.720
## NA's :123 NA's :123 NA's :123
## Urban_population Latitude Longitude
## Min. : 35588 Min. :-38.42 Min. :-172.10
## 1st Qu.: 55908316 1st Qu.: 20.59 1st Qu.: -95.71
## Median :270663028 Median : 37.09 Median : -51.93
## Mean :224214982 Mean : 26.63 Mean : -14.13
## 3rd Qu.:270663028 3rd Qu.: 37.09 3rd Qu.: 78.96
## Max. :842933962 Max. : 61.92 Max. : 138.25
## NA's :123 NA's :123 NA's :123
It can find that most numeric variables, like subscribers and lowest_monthly_earnings are highly right-skewed.
Use the names function to get the names of each column in data
## [1] "rank"
## [2] "Youtuber"
## [3] "subscribers"
## [4] "video.views"
## [5] "category"
## [6] "Title"
## [7] "uploads"
## [8] "Country"
## [9] "Abbreviation"
## [10] "channel_type"
## [11] "video_views_rank"
## [12] "country_rank"
## [13] "channel_type_rank"
## [14] "video_views_for_the_last_30_days"
## [15] "lowest_monthly_earnings"
## [16] "highest_monthly_earnings"
## [17] "lowest_yearly_earnings"
## [18] "highest_yearly_earnings"
## [19] "subscribers_for_last_30_days"
## [20] "created_year"
## [21] "created_month"
## [22] "created_date"
## [23] "Gross.tertiary.education.enrollment...."
## [24] "Population"
## [25] "Unemployment.rate"
## [26] "Urban_population"
## [27] "Latitude"
## [28] "Longitude"
The columns names are not have the same format then use function to uniform the format.
Use the clean_names function in the janitor package to normalize the column names of the data frame
## [1] "rank" "youtuber"
## [3] "subscribers" "video_views"
## [5] "category" "title"
## [7] "uploads" "country"
## [9] "abbreviation" "channel_type"
## [11] "video_views_rank" "country_rank"
## [13] "channel_type_rank" "video_views_for_the_last_30_days"
## [15] "lowest_monthly_earnings" "highest_monthly_earnings"
## [17] "lowest_yearly_earnings" "highest_yearly_earnings"
## [19] "subscribers_for_last_30_days" "created_year"
## [21] "created_month" "created_date"
## [23] "gross_tertiary_education_enrollment" "population"
## [25] "unemployment_rate" "urban_population"
## [27] "latitude" "longitude"
Find invalid value 0 in numerical variables, and find missing values “NaN” and “nan” in categorical variables, convert them to NA
According to the observation data, some invalid rows can be found. For example, the number of subscriptions is tens of thousands, but the number of views and uploaded videos is 0, and the corresponding income is also 0. This data should be collected for commercial protection reasons, so these data are somewhat distorted
# Find those colunms
variables_to_check <- c("video_views", "uploads", "lowest_monthly_earnings",
"highest_monthly_earnings", "lowest_yearly_earnings",
"highest_yearly_earnings")
# Find rows that meet both conditions
rows_with_all_na <- which(rowSums(is.na(youtube_data[, variables_to_check])) == length(variables_to_check))
# Delete these rows
deleted_rows <- youtube_data[rows_with_all_na, ]
deleted_values <- deleted_rows[, variables_to_check]
youtube_data_cleaned <- youtube_data[-rows_with_all_na, ]
# Print all deleted rows
print(deleted_values)## video_views uploads lowest_monthly_earnings highest_monthly_earnings
## 6 NA NA NA NA
## 13 NA NA NA NA
## 103 NA NA NA NA
## 361 NA NA NA NA
## 593 NA NA NA NA
## lowest_yearly_earnings highest_yearly_earnings
## 6 NA NA
## 13 NA NA
## 103 NA NA
## 361 NA NA
## 593 NA NA
By analyzing the created_year column that needs to be used later, i found the existence of outlier
ggplot(youtube_data_cleaned, aes(x = created_year)) +
geom_boxplot(outlier.colour="darkred",outlier.shape=18, outlier.size=8) +
labs(title = "Box Plot of Created Year",
x = "Year",
y = "Created Year") +
plot_theme + coord_flip()It can be found that the outlier shows that it is 1970, and this year is an outlier, so this line of data has no reference significance. It should be deleted.
q1 <- quantile(youtube_data_cleaned$created_year, 0.25, na.rm = TRUE)
q3 <- quantile(youtube_data_cleaned$created_year, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
youtube_data_cleaned <- youtube_data_cleaned[youtube_data_cleaned$created_year >= lower_bound & youtube_data_cleaned$created_year <= upper_bound, ]Convert all NA values in the channel type column to the string “missing”.
For the purpose of maintaining data integrity, retaining sample size, retaining data structure, reducing bias, etc., convert NA in numerical variables into the median of the corresponding variable using function.
impute_na_with_median <- function(data) {
for (col_name in names(data)) {
col <- data[[col_name]]
if (is.numeric(col)) {
median_val <- median(col, na.rm = TRUE)
col[is.na(col)] <- median_val
data[[col_name]] <- col
}
}
return(data)
}
youtube_data_cleaned <- impute_na_with_median(youtube_data_cleaned)To make subsequent data analysis more convenient, delete some unnecessary columns.
Transform all the categorical variables into factors using function
Look at the distribution of YouTube channel
sorted_youtube_data <- youtube_data_transformed
channel_counts <- table(sorted_youtube_data$channel_type)
sorted_channel_types <- names(sort(channel_counts))
# Create a table of channel type counts
channel_counts <- table(sorted_youtube_data$channel_type)
# Convert channel_counts into a data frame for easier manipulation
channel_counts_df <- data.frame(channel_type = names(channel_counts), count = as.numeric(channel_counts))
# Sort the channel counts data frame by count
sorted_channel_counts_df <- channel_counts_df[order(channel_counts_df$count, decreasing = FALSE), ]
ggplot(sorted_channel_counts_df, aes(x = factor(channel_type, levels = sorted_channel_counts_df$channel_type), y = count)) +
geom_bar(stat = "identity",fill = 'grey90',color = 'darkred',bins = 20) +
geom_text(aes(label = count), vjust = -0.5,color = 'darkred') +
labs(title = "Distribution of Channel Type",
x = "Channel Type") + plot_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1,size = 10),
axis.text.y = element_text(hjust = 0.5,size = 10),
plot.title = element_text(size = 18))It can be found that the proportion of music channels and entertainment channels is the largest, indicating that the YouTube platform focuses on relaxation and entertainment for people. Although the proportion of educational channels is not large, it is almost equal to that of comedy channels, which shows that some people use YouTube to learn knowledge.
world_map <- map_data("world")
Video_Views_by_Country <- youtube_data_transformed |>
group_by(country) |>
summarise(videoviews = sum(video_views))
map_data <- world_map |>
left_join(Video_Views_by_Country, by = c("region" = "country"), copy = TRUE)
ui <- fluidPage(
titlePanel("Shiny App"),
verticalLayout(
wellPanel(
selectInput("plot_type", "Select Plot Type:",
choices = c("scatter_plot", "density_plot", "map")),
# When select different plot types, different widgets will be showed
conditionalPanel(
condition = "input.plot_type == 'scatter_plot'",
verticalLayout(
selectInput("numeric_var", "Select Numeric Variable:",
choices = c("subscribers", "video_views")),
plotOutput("scatter_plot", width = "800px", height = "600px")
)
),
conditionalPanel(
condition = "input.plot_type == 'density_plot'",
verticalLayout(
radioButtons("earnings_var", "Select Earnings Variable:",
choices = c("monthly_earnings", "yearly_earnings")),
plotOutput("density_plot", width = "800px", height = "400px")
)
),
conditionalPanel(
condition = "input.plot_type == 'map'",
verticalLayout(
sliderInput(inputId = "slider_range", label = "Video Views Range",
min = 0, max = max(youtube_data_transformed$video_views),
value = c(0, max(youtube_data_transformed$video_views))),
plotOutput("my_map", width = "800px", height = "600px")
)
)
)
)
)
server <- function(input, output) {
# Scatter plot
output$scatter_plot <- renderPlot({
x_var <- switch(input$numeric_var,
"subscribers" = "subscribers",
"video_views" = "video_views")
ggplot(youtube_data_transformed, aes_string(x = x_var, y = "channel_type", color = "channel_type")) +
geom_point() +
ggtitle(paste("Relationship between", input$numeric_var, "and Channel Type")) +
ylab("Channel Type") +
xlab(input$numeric_var) +
theme_minimal() + plot_theme +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
plot.title = element_text(size = 17,hjust = 0.5),
legend.key.size = unit(1.2, "cm"),legend.text = element_text(size = 12),legend.title = element_text( size = 13))
})
# Density plot
output$density_plot <- renderPlot({
earnings_var <- input$earnings_var
if (earnings_var == "monthly_earnings") {
p <- ggplot(data = youtube_data_transformed) +
geom_density(aes(x = log(lowest_monthly_earnings), fill = "Lowest Monthly Earnings"), alpha = 0.5) +
geom_density(aes(x = log(highest_monthly_earnings), fill = "Highest Monthly Earnings"), alpha = 0.5) +
scale_x_log10() +
ggtitle("Monthly Earnings") +
xlab("Earnings(log)") +
ylab("Density") +
labs(fill = "Variable") +
plot_theme +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
plot.title = element_text(size = 17,hjust = 0.5),
legend.key.size = unit(1.2, "cm"),legend.text = element_text(size = 12),legend.title = element_text( size = 13))
p + annotate("text", x = max(p$data$x), y = 0, label = "Lowest Monthly Earnings", vjust = 0, hjust = 1, color = "lightblue") +
annotate("text", x = max(p$data$x), y = -0.02, label = "Highest Monthly Earnings", vjust = 0, hjust = 1, color = "darkred")
} else if (earnings_var == "yearly_earnings") {
p <- ggplot(data = youtube_data_transformed) +
geom_density(aes(x = log(lowest_yearly_earnings), fill = "Lowest Yearly Earnings"), alpha = 0.5) +
geom_density(aes(x = log(highest_yearly_earnings), fill = "Highest Yearly Earnings"), alpha = 0.5) +
scale_x_log10() +
ggtitle("Yearly Earnings") +
xlab("Earnings(log)") +
ylab("Density") +
labs(fill = "Variable") +
plot_theme +
theme(axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
plot.title = element_text(size = 17,hjust = 0.5),
legend.key.size = unit(1.2, "cm"),legend.text = element_text(size = 12),legend.title = element_text( size = 13))
p + annotate("text", x = max(p$data$x), y = 0, label = "Lowest Yearly Earnings", vjust = 0, hjust = 1, color = "lightblue") +
annotate("text", x = max(p$data$x), y = -0.02, label = "Highest Yearly Earnings", vjust = 0, hjust = 1, color = "darkred")
}
})
# Map
filtered_data <- reactive({
subset(youtube_data_transformed,
video_views >= input$slider_range[1] & video_views <= input$slider_range[2])
})
output$my_map <- renderPlot({
filtered_map_data <- left_join(map_data, filtered_data(), by = c("region" = "country"))
ggplot(data = filtered_map_data) +
geom_polygon(aes(x = long, y = lat, group = group, fill = video_views / 1e9), color = "white") +
scale_fill_gradient(name = "Video Views", low = "lightgrey", high = "darkred", labels = scales::comma) +
coord_fixed(ratio = 1.3) +
labs(title = "Video Views of Countries", fill = "Video Views") +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(size = 20,hjust = 0.5,color = "darkred"),
legend.key.size = unit(1.2, "cm"),
legend.text = element_text(size = 12),
legend.title = element_text( size = 13))
})
}
shinyApp(ui, server)it can be seen that the distribution of subscriptions and views on different channel types is roughly the same. In terms of subscriptions, music, entertainment, games, and education channels have a relatively large number of subscribers, and music channels have the largest number of subscribers at 245 million. In terms of playback volume, it is mainly entertainment, music, and education channels that have a relatively large playback volume, and the largest playback volume is also created by the music channel.
Whether it is monthly earnings or annual income, the general trend of changes is similar. First the curve is longer for lower incomes and then rises steeply to a peak income. At the same time the peak of the lowest income is always lower than the peak of the highest income
By sliding the slider, you can see the video views in different intervals and the corresponding city distribution. It can be seen that the countries with high video views are mainly concentrated in India and Russia, because the blocks in these countries are darker
Because the country data is too large and there are many countries with very small playback volume, i decided to extract the top 100 countries for analysis
new_youtube_data <- data.frame(
Country = youtube_data_transformed$country,
channel_type = youtube_data_transformed$channel_type
)
counting <- table(new_youtube_data$Country, new_youtube_data$channel_type)
counting_df <- as.data.frame(counting)
names(counting_df) <- c("Country", "Channel_type", "Count")
top_countries <- counting_df |>
arrange(desc(Count)) |>
head(100)
ggplot(data = top_countries, mapping = aes(x = Country, y = Channel_type)) +
geom_count(aes(size = Count, fill = Count)) +
scale_size_continuous(range = c(5, 20)) +
scale_fill_gradient(low = "lightblue", high = "darkred") +
ggtitle("Top 100 Countries vs. Channel Type") +
labs(x = "Country", y = "Channel Type", fill = "Count") +
plot_theme + coord_flip() +
theme(panel.background = element_rect(fill = "gray"),
panel.grid.major = element_line(color = "white", linetype = 1),
axis.text = element_text(size = 30,hjust = 1),
axis.text.x = element_text(angle = 45),
axis.title = element_text(size = 35),
plot.title = element_text(size = 45),
legend.key.size = unit(3, "cm"),
legend.title = element_text(size = 25),
legend.text = element_text(size = 25))It can be found that most of these top 100 countries have a relatively large proportion of entertainment and music. At the same time, people in the two countries of united states and India watch the widest range of channels.
custom_colors <- c("#1f78b4", "#33a02c", "#e31a1c", "#ff7f00", "#6a3d9a",
"#b15928", "#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f",
"#cab2d6", "#ffff99", "#8dd3c7", "#bebada", "#80b1d3")
ggplot(youtube_data_transformed, aes(x = channel_type, y = subscribers_for_last_30_days, fill = channel_type)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = custom_colors) +
labs(title = "Subscribers for Last 30 Days by Channel Type",
x = "Channel Type",
y = "Numbers of Subscribers") +
theme(axis.text = element_text(size = 25),
axis.title = element_text(size = 35),
legend.key.size = unit(3, "cm"),
legend.text = element_text(size = 30),
legend.title = element_text(size = 30),
plot.title = element_text(size = 45,hjust = 0.5,color = "darkred"))+
coord_polar() Similar to the distribution of the overall number of subscribers, the entertainment and music channels still have the largest number of subscriptions. The number of people following the people channel is second only to the number of people following the music channel.
date_info <- data.frame(year = youtube_data_transformed$created_year,
month = youtube_data_transformed$created_month)
date_info <- na.omit(date_info)
date_info <- date_info[order(date_info$year, date_info$month), ]
ggplot(date_info, aes(x = factor(year), fill = factor(month))) +
geom_bar() +
scale_x_discrete(labels = unique(date_info$year), expand = c(0.05, 0)) +
labs(title = "Distribution of creation time",
x = "Year",
y = "Count",
fill = "Month") +
plot_theme +
theme(axis.text = element_text(size = 20),
axis.title = element_text(size = 23),
plot.title = element_text(size = 30),
legend.key.size = unit(1.5, "cm"),
legend.text = element_text(size = 15),
legend.title = element_text( size = 15))+
scale_fill_brewer(palette = "Set3")As can be seen from this bar chart, the number of channels created in the three years of 2006, 2011 and 2014 is far greater than in other years, and 2014 is the last year of large-scale growth. After 2014, the overall number of channel creations tended to decline, and even if there was an increase, it was only a small increase.
ggplot(data = youtube_data_transformed) +
geom_point(mapping = aes(y = video_views, x = created_year), color = "darkred",shape = 18,size = 4)+
geom_smooth(mapping = aes(y = video_views, x = created_year))+
labs(title = "Created year vs. Video Views")+plot_theme +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size = 15),
axis.text = element_text(size = 13))It can be seen that the number of views and created channels are directly in line with the development trend of channel creation. 2006 was not only the year when the number of created channels increased sharply, but also the year with the largest number of views. After that, the number of views remained at a certain level, accompanied by small-scale growth and slowdown.
count_data <- youtube_data_transformed |>
group_by(created_year, channel_type) |>
count()
p <- ggplot(count_data, aes(x = created_year, y = n, color = channel_type,text = paste("Channel Type: ", channel_type, "\nYear: ", created_year, "\nCount: ", n))) +
geom_point() +
geom_line() +
geom_point(size = 3, alpha = 0.6)+
labs(title = " Created Year vs. Count by Channel Type",
x = "Created Year",
y = "Count",
color = "Channel Type")+
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 11),
plot.title = element_text(size = 17,hjust = 0.5,color = "darkred"),
legend.key.size = unit(0.7, "cm"),
legend.text = element_text(size = 10),
legend.title = element_text( size = 13))
ggplotly(p, tooltip = "text")We can move the cursor to view the specific channel type, year and corresponding number of videos uploaded by the channel at each point. It can be seen from the observation point map that among the channel types each year, the entertainment category is the type with the most videos uploaded by bloggers. At the same time, not all channel types are uploaded by bloggers every year.
ggplot(youtube_data_transformed, aes(x = created_month, y = channel_type, color = channel_type)) +
geom_point() +
geom_line(aes(group = channel_type)) +
labs(title = "Created Month vs. Channel Type",
x = "Created Month",
y = "Channel Type",
color = "Channel Type")+
theme_minimal() +
theme(axis.title = element_text(size = 20),
axis.text = element_text(size = 16),
plot.title = element_text(size = 27,hjust = 0.5,color = "darkred"),
legend.key.size = unit(1, "cm"),
legend.text = element_text(size = 16),
legend.title = element_text( size = 19))It can be seen that bloggers upload videos to every type of channel almost every month. Using only two channels, Nonprofit and Autos, there are almost no bloggers uploading videos for more than a few months.
The data set global youtube statisctic from Kaggle platform has 995 observations and 28 variables. There are missing values and invalid values in the data. It has been processed according to the corresponding analysis requirements and can be used for analysis.
In the analysis data, the analysis related to channel type, whether it is analyzing the type of channel type, or analyzing the relationship with the number of broadcasts or subscriptions, is the type of entertainment that is the most outstanding.
A clear trend can be seen in the income variable, that is, the income level of top bloggers is quite high, and the income of small bloggers is meager. Whether the amount of income is related to the time of creating accounts is worth further exploration.
Regarding the relationship between the number of broadcasts and the time of creating an account, it can be seen that there is a bonus period, but even if there is an increase after the bonus period, it will only increase slightly. It is reasonable to speculate that the earlier the account is created, the more likely it is to become a top blogger.
Holtz, Y. (n.d.). The R Graph Gallery – Help and inspiration for R charts. The R Graph Gallery. https://r-graph-gallery.com/
OpenAI. (2023). ChatGPT (August 3 Version) [Large language model]. https://chat.openai.com